[QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules #1869

brian-dellabetta · 2025-09-26T15:29:53Z

SUMMARY:
#1772 introduced a bug when running NVFP4 quantization schemes. The call to update_fused_layer_weight_global_scales needs to be run on Attention and MLP layers, which are not included in targets consisting of quantizable layers inside Attention/MLP. This PR fixes that by running update_fused_layer_weight_global_scales on every module instead of the targeted ones, which is ok because the call is idempotent and will only modify if the modules have NVFP4 schemes. This is only a problem in QuantizationModifier, AWQ cannot be used with NVFP4.

TEST PLAN:
Confirmed that the working vs. broken global scales are mismatched because the update is never run:

model.layers.0.self_attn.k_proj.weight_global_scale -- working 9600.0, broken 12992.0
model.layers.0.self_attn.q_proj.weight_global_scale -- working 9600.0, broken 9600.0
model.layers.0.self_attn.v_proj.weight_global_scale -- working 9600.0, broken 12160.0

And these changes resolve the regression:
Before

vllm (pretrained=/home/dsikka/llm-compressor/examples/quantization_w4a4_fp4/Qwen3-30B-A3B-NVFP4,dtype=auto,max_model_len=4096,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8135|±  |0.0107|
|     |       |strict-match    |     5|exact_match|↑  |0.8097|±  |0.0108|

After

vllm (pretrained=/home/brian-dellabetta/projects/llm-compressor/Qwen3-30B-A3B-NVFP4,dtype=auto,max_model_len=4096,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8620|±  |0.0095|
|     |       |strict-match    |     5|exact_match|↑  |0.8575|±  |0.0096|

Signed-off-by: Brian Dellabetta <[email protected]>

github-actions · 2025-09-26T15:30:01Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2025-09-26T15:30:06Z

Summary of Changes

Hello @brian-dellabetta, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug within the QuantizationModifier that impacted NVFP4 quantization schemes. The issue stemmed from an incorrect application scope for updating fused layer weight global scales, which led to performance degradation in models using NVFP4. By broadening the application of this update function to all model modules, the PR ensures accurate global scale calibration, thereby restoring and improving the quantization quality and model performance, as evidenced by improved GSM8K benchmark scores.

Highlights

Bug Fix for NVFP4 Quantization: Addresses a regression introduced in [Multi-modifier] Support scoped application of quantization config/status #1772 where the update_fused_layer_weight_global_scales function was not correctly applied to Attention and MLP layers during NVFP4 quantization, leading to incorrect global scales and degraded model performance.
Expanded Scope of Global Scale Update: The update_fused_layer_weight_global_scales function is now applied to all modules in the model, rather than only targeted quantizable layers. This ensures correct application to Attention and MLP layers, leveraging the function's idempotent nature to safely update scales across the entire model.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a bug in QuantizationModifier related to NVFP4 quantization schemes. The change ensures update_fused_layer_weight_global_scales is called on all modules, not just the targeted ones, which is necessary for Attention and MLP layers. The fix is sound, as the function is idempotent. My review includes a minor suggestion to add a progress bar for better user experience, consistent with the surrounding code.

src/llmcompressor/modifiers/quantization/quantization/base.py

Signed-off-by: Brian Dellabetta <[email protected]>

shanjiaz

Looks good to me! Thanks for fixing!

src/llmcompressor/modifiers/quantization/quantization/base.py

run fused layer update on all modules

8b203e2

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta requested review from dsikka, rahul-tuli, fynnsu and shanjiaz September 26, 2025 15:29

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

src/llmcompressor/modifiers/quantization/quantization/base.py Outdated Show resolved Hide resolved

brian-dellabetta added the ready When a PR is ready for review label Sep 26, 2025

brian-dellabetta changed the title ~~[QuantizationModifier] bugfix -- fused layer update on all modules~~ [QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules Sep 26, 2025

tqdm labels

a40d3bf

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta requested a review from shubhra September 26, 2025 15:38

shanjiaz approved these changes Sep 26, 2025

View reviewed changes

fynnsu approved these changes Sep 26, 2025

View reviewed changes

brian-dellabetta merged commit a824136 into main Sep 26, 2025
8 checks passed

brian-dellabetta deleted the bdellabe/quant-modifier-nvfp4-fix branch September 26, 2025 17:27

dsikka reviewed Sep 26, 2025

View reviewed changes

src/llmcompressor/modifiers/quantization/quantization/base.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules #1869

[QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules #1869

Uh oh!

brian-dellabetta commented Sep 26, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

shanjiaz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules #1869

[QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules #1869

Uh oh!

Conversation

brian-dellabetta commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot commented Sep 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

shanjiaz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta commented Sep 26, 2025 •

edited

Loading